An Information-Theoretic Approach to Network Modularity 



Etay Zivt 

College of Physicians & Surgeons, 
Department of Biomedical Engineering, 
Columbia University 

Manuel Middendorfl 

j Department of Physics, 

f~>) . Columbia University 



o 

"o 

I 

a 

> 

x 



Chris Wiggins 



> 

. 

h^- . Department of Applied Physics and Applied Mathematics, 

i Center for Computational Biology and Bioinformatics, 

\^ ' Columbia University 

1 These authors contributed equally to this work. 

Exploiting recent developments in information theory, we propose, illustrate, and validate a prin- 
cipled information-theoretic algorithm for module discovery and resulting measure of network mod- 

aularity. This measure is an order parameter (a dimensionless number between and 1). Comparison 
is made to other approaches to module-discovery and to quantifying network modularity using Monte 
q ' Carlo generated Erdos-like modular networks. Finally, the Network Information Bottleneck (NIB) 

• i-H | algorithm is applied to a number of real world networks, including the "social" network of coauthors 

'. at the APS March Meeting 2004. 

1 i PACS numbers: 89.75.Fb, 87.23.Ge, 87.10.-f-e, 05.10.-a 
> 1 1. INTRODUCTION 

m 
m . 

A goal of modeling is to describe the system in terms of less complex degrees of freedom while retaining information 
deemed relevant Q . Approaches to complexity reduction for networks include characterizing the network in terms of 
t-H | simple statistics (such as degree distributions [3] or clustering coefficients 0), subgraphs over-represented relative to 
an assumed null model (see for example, 0,0,0]); an< ^ communities (see for example, Q). 

In the case of communities, networks are coarse-grained into clusters of nodes, or modules, where nodes belonging 
to one cluster are highly interconnected, yet have relatively few connections to nodes in other clusters. This type of 
network complexity reduction may be particularly promising as an approach to network analysis, since many naturally- 
occurring networks, including biological |8| and sociological pf |9j networks are thought to be modular. Clearer, 
quantitative understanding of these ideas would be valuable in finding reduced complexity descriptions of networks, 
in visualizing networks, and in revealing global design principles. Two current challenges facing the community 
regarding network modularity include (i) the ability to quantify to what extent a given network is "modular" and (ii) 
the ability to identify the modules of a given network. 

With regard to quantifying modularity, to our knowledge no mathematical definition has yet been proposed for 
a measure of modularity that could compare networks regardless of size, origin, or choice of partitioning. In their 
recent book, Schlosser and Wagner [T^j write "a generally accepted definition of a module does not exist and different 
authors use the concept in quite different ways." They proceed to warn of the "danger that modularity will degenerate 
into a fashionable but empty phrase unless its precise meaning is specified." Some steps in this direction have been 
suggested by Newman's "assortativity coefficient" Jl l| . which quantifies the level of assortative mixing in a network, 
and its unnormalized form, called "modularity" [7j. However, these measures quantify the quality of a particular 
partitioning of the network for a given number of modules, but are not a property of the network itself that could 
serve to compare networks of different origins. 

As for module discovery, a range of techniques for identifying the modules in a network have been utilized with 
various success. In his review article, Newman |l2j summarizes these efforts under the broad category of hierarchical 
clustering in which one poses a similarity metric between pairs of vertices. Hierarchical clustering can be agglomerative, 
where the most similar nodes are iteratively merged (e.g. |13|| ) or divisive, where edges between the least similar nodes 
are iteratively removed (e.g. Q). By modifying traditional divisive approaches to focus on most "between" edges 
rather than least similar vertices, Newman and Girvan recently proposed a new class of divisive algorithms for 
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finding modules. Various measures of "edge-betweenness" are defined to identify edges that lie between communities 
rather than within communities. By iteratively removing edges with highest betweenness one can break down the 
network into disconnected components which define the modules. 

In this manuscript we take a markedly different approach. The problem of finding reduced descriptions of systems 
while retaining information deemed relevant has been well-studied in the learning theory community. In particular, 
the information bottleneck 0, Il4j | provides a unified and principled framework for complexity reduction. By applying 
the information bottleneck on probability distributions defined by graph diffusion, we propose a new, principled, 
information-theoretic algorithm to identify modules in networks. We demonstrate that the Network Information 
Bottleneck (NIB) algorithm outperforms the currently used technique of edge-betweenness (i) in correctly assigning 
nodes to modules and (ii) in determining the optimal number of existing modules. Moreover, the new method naturally 
defines a network modularity measure which can compare any two undirected networks to the extent to which the 
topology of each can be summarized by modules over all scales. Information-theoretic bounds constrain this measure 
to be between and 1. Finally, we apply our method to a collaboration network derived from the APS March Meeting 
2004 and the E. coli genetic regulatory network. 



2. THE INFORMATION BOTTLENECK: A REVIEW 

Brief 1,] and detailed 0] discussions of the information bottleneck can be found elsewhere; we here review 
only the most salient features. The fundamental quantity in information theory is Shannon entropy H[p(x)] = 
~^2 x p{x)\ogp{x) measuring lack of information (or disorder) in a random variable X, and uniquely (up to a con- 
stant) defined by three plausible axioms |15| . Knowledge of a second random variable Y decreases the entropy in X 
on average by an amount 

I(X, Y) = H(X) - (H(X\Y)) = H[p(x)} - ^2p{y)H\p(x\y)] (1) 

y 

called the mutual information |l6j |. the average information gained about X by the knowledge of Y. Eq. Q is 
equivalent to 

I(X, Y) = £ £ p(x, y) log *M> = (log *M> > (2) 

, j p{x)p{y) p{x)p{y) 

revealing its symmetry in X and Y. The mutual information thus measures how much information one random 
variable tells about the other, and is the basis of the information bottleneck. 

Clustering can generally be described as the problem of extracting a compressed description of some data that 
captures information deemed relevant or meaningful. For example, we might want to cluster protein sequences, 
expecting that the cluster assignments contain information about the fold of the proteins; or we might want to cluster 
words in documents, expecting that the clusters capture information about the topic in which the words appear. 
Tishby et al.'s Q key insight into this problem is the inclusion in the clustering algorithm of another random variable, 
called the relevance variable, which describes the information to be preserved. In the case of protein sequences, the 
relevance variable might be the protein fold; in the case of clustering words over documents, the relevance variable 
might be the topic. 

Let x 6 X be the input random variable (e.g., protein sequences in the set of all observed sequnces; or words 
in a given dictionary, in the two examples above), y € Y the relevance variable, and z 6 Z the cluster assignment 
random variable |31| . The information bottleneck outputs a probabilistic cluster assignment function p(z\x) equal to 
the probability to be in cluster z for a given input x. The clustering minimizes the mutual information between X 
and Z ("maximally compressing the data set"), while constraining the possible loss in mutual information between 
Z and Y ("preserving relevant information"). In other words, one seeks to pass or squeeze the information that X 
provides about Y through the "bottleneck" formed by the compressed Z. 

The simplicity of the model Z relative to that of the world X is quantified by the entropy reduction <S[p(.z|x)] = 
H(X) — I(X, Z) = H(X, Z) — H(Z) > 0. The gain in simplicity, however, comes with a loss of fidelity in our 
description of the world, quantified by the error £ = I(X, Y) — I(X, Z) > 0, the loss in information about the world 
when described by a model Z instead of the primitive description X. The trade-off between the error and the simplicity 
can be expressed in terms of the functional 

T[p{z\x)] = £ - TS = To - I{X, Z) + TI{X, Z) (3) 

in which the temperature T parameterizes the relative importance of simplicity over fidelity. The term T§ is inde- 
pendent of the cluster assignment p[z\x). Since p(y\x, z) = p(y\x), this is the only degree of freedom over which the 
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free energy T is to be minimized. In the annealed ground state (T — > 0) each possible state of the world x £ X 
is assigned with unit probability to one and only one state of the model z £ Z (i.e., p(z\x) £ {0, 1}, a limit called 
"hard clustering"). If the cardinalities \Z\ and \X\ are equal, we arrive at the fully detailed, trivial solution where the 
clusters Z simply copy the original X . A formal solution to the information bottleneck problem is given in and 
yields the following three self-consistent equations (with f3 = 1/T), 

' P( z \ x ) = ^^e-^^tPG/MllpO/W] 

< J»(«) = E*k z \ x )p( x ) ( 4 ) 

^p(y\ z ) = ^T, x p(y\ x )p( z \ x )p( x ) 

where Z(f3, x) is a normalization (partition) function and = J2 x p( x ) 1°8 f{§) ^ s ^ ne Kullback-Leibler diver- 

gence (also called the relative entropy). The first of these equations makes clear that as one anneals to ground state, 
where T — > and j3 — > oo, the only solution is the hard clustering (p(x\z) £ {0,1}) limit. These three equations 
naturally lend themselves to an iterative algorithm proposed in Q] which is provably convergent and finds a locally 
optimal solution. 

While in many applications a "soft" clustering might be of interest, for clarity we only consider the hard case in this 
paper: each node is associated with one and only one module. We use two different algorithms to find approximate 
solutions to the information bottleneck problem. Both of them take a fixed \Z\ (\Z\ < \X\) as input and output hard 
clustering assignments for every node. 

The first algorithm (self- consistent NIB) uses (3 as an annealing parameter that starts at low values and increases 
step by step. At every given f3 the locally optimal solution is computed by iterating over Equations The solution 
for given (3 is then taken as a starting point for the iterations with the next f3. The second algorithm (agglomerative 
NIB) uses an agglomerative approach |l7j . At every step a pair of nodes is merged into a single node, where the pair 
is chosen such as to maximize the relevant information I(Y,Z). It thus reduces \Z\ by one at every step, and stops 
when the desired \Z\ is reached. 



3. DIFFUSIVE DISTRIBUTIONS DEFINED OVER GRAPHS 



We wish to find a representation of a network in which groups of nodes have been represented by effective nodes; 
we argue that a modular description of the network is most successful when relevant information about the network 
is preserved. Posed in this language, it is clear that the act of finding modules in a network is a type of clustering, 
and the appropriate clustering framework is one that preserves the information deemed relevant. 

Formulation of graph clustering in terms of the information bottleneck requires a joint distribution p(y, x) to be 
defined on the graph, where x designates nodes and y designates a relevance variable. An appropriate distribution 
that captures structural information about the network is the one defined by graph diffusion. The relevance random 
variable y then ranges over the nodes, as does x, and is defined by the node at which a random walker stands at 
a given time t if the random walker was standing at node x at time 0. The conditional probability distribution 
Gjj = p t (yi\xj) is a Green's function describing propagation from node j to node i. For discrete time diffusion one 
can easily derive 0] 

G 1 = [WT- 1 ] 1 , (t £ N) (5) 

where W is a symmetric weighted affitiny matrix of positive entries and Ty- = 5y Wu = Sijki. For a graph, with 
identically-weighted edges, fej is the conventional degree (the number of neighbors of node i), and W is the adjacency 
matrix (Wij = 1 iff i is adjacent to j). Note that we here only consider connected graphs and, as defined, this approach 
treats directed and undirected graphs identically. 
In the continuous time limit 

G f = eS WT "-^ - e - LT_lt , (t > 0) (6) 

where we defined L = T — W , the graph Laplacian ^^|. In the machine learning literature a "graph kernel" [2(j has 
been defined as 

G l =e- Lt . (7) 

to learn from structured data. It corresponds to a probability distribution associated with a different diffusion rule 
assuming a degree-dependent permeability at every node. For comparison, we consider both of these joint distributions 
as possible input to the information bottleneck algorithm. 
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The characteristic time scale r of the system is given by the inverse of the smallest non-zero eigenvalue of the 
diffusion operator exponent (LT^ 1 or L ). This time reflects the finite system size and characterizes large-scale 
behaviors. For example, in one dimension on a bounded domain of size £, the smallest non-zero eigenvalue of the 
Laplacian with diffusion constant D is ir 2 D/£ 2 . For our algorithm we will thus choose t = r. 

To calculate the joint probability distribution p(y,x) — p(y\x)p(x) from the conditional probability distribution 
G T = p(y\x), we must specify a prior p(x): the distribution of random walkers at time 0. Natural definitions include 
(i) a flat prior p(x) = 1/N, N being the total number of nodes and (ii) a prior corresponding to the steady state 
distribution associated with the diffusion operator: p(x) — 1/N or p(x) — k x /^2 x k x , for G r = e~ Lr or G T = e~ LT T , 
respectively, where k x is the degree of node x. 

4. QUANTIFYING MODULARITY 

4.1. Partition modularity — quality of a partitioning 

Newman and Girvan propose a modularity, a "measure of a particular division of a network", as Q = X)J e M — 
e v)(J2k e ik)]i where eij is the fraction of all edges connecting module i and module j. It can be interpreted as 
the difference between the fraction of within-module edges and the expected fraction of within-module edges in an 
ensemble of networks created by randomizing all connections while holding constant the number of edges emanating 
from each module. Q should therefore go to for randomly connected networks, and tend to 1 — 1/|Z| for a perfectly 
modular network with \Z\ equally sized modules. We herein refer to the measure Q as partition modularity to 
distinguish it from network modularity which we define below based on information-theoretic quantities. Newman et 
al. also study the number of modules |Z| max which maximizes Q given a particular module discovery algorithm. 

4.2. Network modularity — summarizability of network structure 

We here propose a new modularity measure M, a property of a given network rather than of a given partitioning, 
which quantifies the extent to which a network can be summarized in terms of modules. 

Every clustering solution p{z\x) determines an normalized "input information" < I(Z,X)/H{X) < 1 between 
input variable X and cluster assignment Z, and an "output information" < I(Z, Y)/I(X, Y) < 1 between cluster 
assignment and relevance variable Y. The information curve is then plotted as I(Z, Y)/I(X, Y) vs. I(Z, X)/ H(X) 
for every solution of Equation Q for every possible number of clusters . An example is shown in Figure The 
curve traced by minimizers of the functional Eqn. |3 which will not necessarily be computed by such approximating 
schemes as the AIB, is provably convavc. For perfectly random data, which cannot be summarized, this curve lies 
along the diagonal y = x. Consistent with these observations, we find that synthetic graphs with high connectivity 
within defined modules, and low connectivity between different modules exhibit larger area under the information 
curve (data not shown). We thus define a new measure of network modularity: the area under the information curve. 
In the soft clustering case the information curve is continuous since solutions vary with every choice of /? £ [0, oo). In 
the hard clustering case, which we study here, the information curve is only defined at discrete points corresponding 
to solutions for every possible number of clusters \Z\. The area can then be calculated by linear interpolation. 
Information-theoretic bounds constrain the range of M allowing comparison of networks of varying number of nodes 
and edges, and is a property of the network itself, rather than a given partitioning. 

5. TESTS ON SYNTHETIC NETWORKS 
5.1. Accuracy of the partitioning 

We here test how well various NIB implementations with different diffusion operators can reconstruct modules 
in a network generated with a known modular structure. We also compare our method to the "edge-betweenness" 
algorithm recently proposed in for the same purpose of finding modules or "communities" . In the network is 
broken down into isolated components by iteratively removing edges with highest "betweenness" (several definitions 
of edge-betweenness are tested in 0; we here use the "shortest path" betweenness, which was shown in Q to perform 
optimally) . 

As in 0, we generate synthetic Erdos-like graphs via Monte Carlo with 128 nodes each and average degree 8 
(average total of 512 edges). We also demand that the graphs be connected by rejecting generated graphs that 
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have disconnected components. We impose a structure of 4 modules with 32 nodes each by introducing two different 
probabilities: p- ln for edges inside modules and p out for edges between different modules. The level of noise in the graph 
is thus controlled by p ut- The higher p ou t, the harder it will be to recover the different modules. We first generate 
networks with p out — and then increase p ut while adjusting p lri such that the average degree remains fixed. When 
Pin = Pout, all modular structure is lost and we obtain a usual Erdos graph. We measure the accuracy of a proposed 
partitioning using the following computation. In principle any module proposed by the algorithm could match any 
"true" module with an associated error. We therefore try every possible permutation of the 4 proposed modules 
matching the 4 "true" modules, and consider the one permutation with the smallest total number of incorrectly 
assigned nodes. We define accuracy as the total fraction of correctly assigned nodes. 

Figure^, shows the accuracy of the recovered modules as a function of p ou t/Pin for three different algorithms: 
self-consistent NIB, agglomerative NIB, and betweenness. Both NIB algorithms use the physical diffusion operator 
e -LT t anc j a fl a ^ p r j or jyiV to define a joint probability distribution. We observe that both NIB algorithms are 
much more successful in recovering the modular structure than the betweenness algorithm. A threshold noise level is 
achieved at around Pout/pin ~ 1/3 for the NIB algorithms, and around p ou t/Pin ~ 1/6 for the betweenness algorithm. 
The figure also shows that the self-consistent NIB in general finds a better partitioning than the agglomerative NIB. 

Figure ^> shows the same measurements for self-consistent NIB algorithms using different diffusion operators as 
explained in Section [3] For comparison the betweenness results are also plotted. Physical diffusion, defined by 
the continuous time limit, with an initial state p{x) given by the equilibrium distribution p(x) oa k x , gives best 
performance. 

5.2. Finding the optimal number of modules 

In most real world problems the correct number of modules \Z\ present in the network is unknown a priori. It is 
therefore important to have an algorithm which not only computes a good partitioning for a given \Z\ but also gives 
a good estimate for \Z\ itself. To this end, we here make use of the partition modularity Q as described in Section 

rm 

We again consider synthetic connected networks of 128 nodes and average degree 8 as in the previous section. 
However, we fix the noise level to a value of Pout/Pin = 0.3 which was shown to be a critical level for these networks. 
We run the self-consistent NIB and the betweenness algorithms for every possible number of modules \Z\ = 1,2, ... , 128 
and compute Q for the proposed partitionings. Figure^ shows Q as a function of \Z\ for a typical run. While for 
the NIB algorithm Q sharply peaks at the correct value of |Z| max = 4, Q calculated by the betweenness algorithm 
attains its maximum at |Z| max = 46 and does not show a particular signal at \Z\ — 4. Figure[2p shows a histogram 
of | Z | max for 100 generated networks. The NIB algorithm successfully identifies |Z| max = 4 for 82% of the networks, 
while the betweenness algorithm calculates |Z| max lying between 18 and 89, notably far from the correct value for 
any network. These experiments suggest that the NIB algorithm performs well both in accurately assigning nodes to 
modules and in revealing the optimal scale for partitioning. 

6. APPLICATIONS 
6.1. Collaboration Network 

Having validated NIB on a toy model of modular networks, we next apply our algorithm to two examples of 
naturally-occurring networks. In the first example, we construct a collaboration network from the 2004 APS March 
Meeting, where this algorithm was first presented, and in the second example we construct a symmetric version of 
the E. coli genetic regulatory network. 

Vertices of the collaboration network represent authors from all talks at the March Meeting; edges represent 
coauthorship. The largest component of the resulting graph consists of 5603 vertices with 19761 edges. Network 
information bottleneck using the agglomerative algorithm and the physical diffusion operator (as defined in Section 
[3]with its corresponding equilibrium distribution) reveals that this large network is highly modular (m = 0.9775, see 
Figure Qt) . For comparison, we also show the information curve for a typical Erdos network, which is clearly less 
modular. Such a high value of modularity implies that the authors of this component of the network are "easily" 
compressed or combined into larger clusters of authors. In the light of this fact, we study what the clusters of 
authors reveal about the collaboration network. For example, authors may group themselves according to topics or 
subject matters of the talks; alternatively, author modules may be more indicative of the authors' affiliations or even 
geographical location. 
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FIG. 1: (a) Accuracy for different algorithms. Measured is the accuracy of recovering modular structure in syntheticnet- 
works under varying noise levels. Every point represents an average over 100 networks, each with 128 nodes and an average 
degree of 8. Both NIB algorithms outperform the betweenness algorithm. 

(b) Accuracy for different diffusion operators Accuracy is measured in the same way as in (a), now using the self- 
consistent NIB algorithm with various diffusion operators to define probability distributions p(y,x) over nodes. For comparison 
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(b) 



FIG. 2: (a) Q vs. |Z|. The quality of the partitioning Q is computed for the partitionings output by the self-consistent NIB 
algorithm (with physical diffusion operator), and the betweenness algorithm, for every given number of modules \Z\. While the 
NIB algorithm correctly determines \Z\ = 4 as the best number of modules (at maximum Q), Q calculated by the betweenness 
algorithm peaks at \Z\ = 46. 

(b) Histogram of [Z| max for 100 different networks. For 82% of the networks, the NIB algorithm is able to find the 
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To begin to approach these types of questions we may choose to look at the author groupings given by NIB at a 
particular number of clusters. While we emphasize that network modularity is a measure over all scales or all numbers 
of clusters, it is illustrative in this case also to examine the clustering at a particular scale. For the APS network, 
such an analysis yields the optimal number of modules, \Z\ max — 115, for this network (see Figure^). 

In Figure 01 we plot the 115 modules and their connections where each ellipse represents one module and edges 
between ellipses represent inter-modular connections. The sizes of the ellipses and the thickness of the edges are 
proportional to the log of the number of authors in a module and the log of the number of inter-modular connections 
between modules, respectively. We note the provocative structure revealed in the figure with a large center of highly 
connected modules (including two of the largest modules), three more or less branching, linear chains of modules, and 
one large 18-node cycle of modules. 

Closer inspection of a single module demonstrates that for many of the modules, institutional affiliation, and even 
geography, play a large role in determining collaborations. In Figure |S] a single 17-node module is plotted where 
each node now represents an author and edges represent author collaborations. We see that 15 of the 17 authors 
are affiliated with Columbia University; the remaining two authors are affiliated with Stony Brook, and notably, are 
adjacent (indicating coauthorship) to each other. The finding that the modules in this collaboration network are 
somewhat related to institutional affiliations and geography is supported by similar results found in other physics 
collaboration networks previously studied using different techniques 0. 

Another possible annotation for this module to consider is that of the APS divisions and topical groups, since 
each author is associated with at least one talk and each talk is listed under one or more of these APS categories. 
However, the 14 APS divisions and 10 topical groups appear to be too broad and have too much overlap to clearly 
define a module. For example, the Columbia University module includes talks under the categories of Polymer, 
Condensed Matter, Material, and Chemical Physics. On the other hand, the module is essentially representative of 
researchers at the Columbia University Materials Research Science and Engineering Center (MRSEC) and in particular 
those interested in the synthesis of complex metal oxide nanocrystals. There is thus both topical and institutional 
information retained in the modules. 

It is also revealing to examine the affiliations of multiple connected modules. For example, Figure [5] plots the 
uppermost branching linear chain of Figure 0] Here, color denotes module assignment as given by NIB. Most of 
these modules also have clear institutional affiliations. For example, everyone in the cyan module is at the Center 
of Complex Systems Research (CCSR) in Illinois; close to 80% of the large green module is in China, mostly at the 
Institute of Chemistry Chinese Academy of Sciences (ICC AS); and 70% of the red module is in England. The blue 
module is slightly more diffuse, though an institutional affiliation is also apparent here; over 50% of the authors are 
affiliated with one of three institutions near Chicago (Argonne National Labs, University of Illinois at Chicago, and 
University of Notre Dame). The yellow and magenta modules arc also overwhelmingly associated with the University 
of Nebraska, though interestingly our algorithm separates these two modules at this partitioning. In general, one does 
not anticipate that the optimal number of clusters in a given network will give the most natural partitioning at all 
scales and over all resulting modules. 



6.2. Biological Network 

The notion of modularity has been central in the study of a variety of biological networks including metabolic |13| , 
protein pll 122^. and genetic 0,0 networks. Certainly most biologists agree that the various networks operating 
within and between cells have a modular structure, though what they mean by "modular" can vary greatly |l0|. 

NIB allows us to investigate quantitatively and in detail to what extent naturally-occurring biological networks are 
modular. For example, Figure \7\ depicts the undirected form of the largest component of the E. coli genetic regulatory 
network described previously in Q and [2^ ■ The network consists of 328 vertices and 456 edges and its modularity is 
depicted by the curve one traces in the information plane as the network is clustered using the network information 
bottleneck (see Figure |3|d). 

To establish whether the modularity of the network should be considered low, high, or moderate, we employ an 
ansatz popular in several reserach communities in which a distribution of networks is created by holding the in-, 
out-, and self-degree of each node constant but randomizing the connectivity of the graph, changing which nodes are 
connected to which neighbors pi IM l2ll l24l I25I l26j|. The randomization, a variant of the configuration model [l2^ . 
produces a distribution of networks from which we sample and then measure the network modularity. The histogram 
in Figure |S] shows that E. coifs modularity is higher relative to this ensemble. 
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FIG. 3: (a) APS Network Modularity. Information plane for the collaboration network obtained from the 2004 APS March 
Meeting (largest component consists of 5603 authors and 19761 edges). We use the agglomerative algorithm with the diffusion 

operator e~ LT '. Network modularity for this graph, defined as the area under the curve is 0.9775. Comparison is made with 
a typical information curve obtained from an Erdos graph. The optimal number of modules as defined by the the Newman and 
Girvan measure is at \Z\ — 115. 
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FIG. 4: Adjacency network of the 115 modules of the APS network. Nodes represent modules where the size of the drawn ellipse 
is proportional to the number of authors in the module. Edges between modules represent collaborations between authors in 
different modules, where the thickness of the drawn lines is proportional to the number of these inter-module collaborations. 
The module-network reveals a structure with highly dense center of modules, three branching linear chains of modules, and 
one cycle of modules. 
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FIG. 5: One of the 115 modules of the APS network. Nodes represent authors and edges represent collaborations. Of the 17 
authors in the module, 15 are Columbia-University affiliated and two are affiliated with Stony Brook University. 



7. CONCLUSIONS AND EXTENSIONS 

We have presented a principled, quantitative, parameter-free, information-theoretic definition of network modularity, 
as well as an algorithm for discovering modules of a network. Network modularity is a dimensionless number between 
and 1 and is a property of a given network over all scales, rather than of a given partitioning with a given number of 
modules. The measure is applicable to any network, including those with weighted edges. We validate the effectiveness 
of our algorithm in identifying the correct modules and in finding the true number of modules on synthetic, Monte 
Carlo generated, Erdos-like, modular networks. Finally, application to two real-world networks, a "social" network of 
physics collaborations and a biological network of gene interactions, is demonstrated. 

Network modularity, the area under the curve in the information plane, is but one relevant statistic that we may 
retrieve from the information curve. Certainly other useful statistics may be culled. For example, the optimal 
information curve will always be concave 14] and its slope will decrease monotonically. The point at which the slope 
equals 1 is uniquely determined for each network and can be described as the point after which clustering further 
results in a greater loss in relative relevant information than gain in relative compression (that is, ^ jff^p) = ^ I lf('x) ) • 
This break- even point is the point at which one can gain further (normalized) simplicity only by losing an equivalent 
(normalized) fidelity. Numerical experiments and investigating the utility of this measure are currently in progress. 
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FIG. 6: Six modules corresponding to the uppermost branched linear chain of modules depicted in Figure 0] Colors denote 
modules as defined by the network information bottleneck algorithm. Again the modules roughly correspond to institutional 
affiliations. Over 50% of the blue nodes have one or more affiliations with the institutions based in and around Chicago 
(Argonne National Laboratory, University of Illinois at Chicago and University of Notre Dame). 70% of the red nodes are in 
England, and 75% of the green nodes are in China, mostly at the Institute of Chemistry Chinese Academy of Sciences, and all 
of the cyan nodes are at the Center of Complex Systems Research in Illinois. Both the yellow and magenta modules are mostly 
affiliated with the University of Nebraska. 



FIG. 7: E. coli gene regulatory network. Largest component of the symmetric version of the E. coli genetic regulatory 
network. Colors denoted modules identified by NIB. 



Diffusive distributions are but one general class of distributions on a network. A natural generalization of these 
ideas is to describe other distributions on a network for which a particular function, energy, or origin is known, and 
on which some particular degree of freedom (such as chemical concentration or genetic expression as a function of 
time) may be defined. 

Finally, we note that while the information bottleneck is a prescription for finding the highest-fidelity summary of a 
system at a given simplicity, algorithms for determining network community structure are usually motivated by various 
definitions of normalized min-cuts [271 128. 29, 30] . Our results, particularly for the synthetic graphs with prescribed 
modular structure, demonstrate that information modularity implies edge modularity, an unexpected finding which 
motivates further numerical and analytic investigations in progress regarding this relationship. 
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